48 research outputs found
Ordonnancement de threads OpenMP et placement de données coordonnés sur architectures hiérarchiques
National audienceExploiter le potentiel des machines multiprocesseurs hiérarchiques nécessite une répartition précise des threads et des données sur l'architecture non-uniforme sous-jacente afin d'éviter des pénalités d'accès mémoire. Les langages à base de directives comme OpenMP fournissent au programmeur une façon simple de structurer le parallélisme de leurs applications et de transmettre cette information au support d'exécution. Notre support exécutif, basé sur une ordonnanceur de threads multi-niveaux combiné à un gestionnaire mémoire spécialement conçu pour les architectures NUMA, convertit cette information en indications à l'ordonnanceur pour respecter les affinités entre threads et données. Il offre une distri- bution dynamique de la charge de travail guidée par la structure de l'application et la topologie de la machine cible, dans le but d'atteindre la portabilité des performances. Les premières expériences mon- trent qu'une approche mixte, faisant intervenir conjointement déplacement de threads et migration de données se comporte mieux que les politiques de distribution de données basées sur next-touch, laissant entrevoir la possibilité de nouvelles optimisations
Exécution structurée d'applications OpenMP à grain fin sur architectures multicoeurs
Les architectures multiprocesseurs contemporaines, qui se font naturellement l'écho de l'évolution actuelle des microprocesseurs vers des puces massivement multicœur, exhibent un parallélisme de plus en plus hiérarchique. Pour s'approcher des performances théoriques de ces machines, il faut désormais extraire un parallélisme de plus en plus fin des applications, mais surtout communiquer sa structure — et si possible des directives d'ordonnancement — au support d'exécution sous-jacent. Dans cet article, nous expliquons pourquoi OpenMP est un excellent vecteur pour extraire des applications du parallélisme massif, structuré et annoté et nous montrons comment, au moyen d'une extension du compilateur GNU OpenMP s'appuyant sur un ordonnanceur de threads NUMA-aware, il est possible d'exécuter efficacement des applications dynamiques et irrégulières en préservant l'affinité des threads et des données
An Efficient OpenMP Runtime System for Hierarchical Arch
Exploiting the full computational power of always deeper hierarchical
multiprocessor machines requires a very careful distribution of threads and
data among the underlying non-uniform architecture. The emergence of multi-core
chips and NUMA machines makes it important to minimize the number of remote
memory accesses, to favor cache affinities, and to guarantee fast completion of
synchronization steps. By using the BubbleSched platform as a threading backend
for the GOMP OpenMP compiler, we are able to easily transpose affinities of
thread teams into scheduling hints using abstractions called bubbles. We then
propose a scheduling strategy suited to nested OpenMP parallelism. The
resulting preliminary performance evaluations show an important improvement of
the speedup on a typical NAS OpenMP benchmark application
De l'exécution structurée de programmes OpenMP sur architectures hiérarchiques
This document illustrates the integration of the Marcel bubble structuration into the GNU OpenMP compiler, to express affinity relations between OpenMP teammates, and introduces Affinity, an OpenMP-dedicated bubble scheduler, made to distribute these bubbles over hierarchical architectures, favouring affinity relations
Design methodology for workload-aware loop scheduling strategies based on genetic algorithm and simulation
International audienceIn high-performance computing, the application's workload must be evenly balanced among threads to deliver cutting-edge performance and scalability. In OpenMP, the load balancing problem arises when scheduling loop iterations to threads. In this context, several scheduling strategies have been proposed, but they do not take into account the input workload of the application and thus turn out to be suboptimal. In this work, we introduce a design methodology to propose, study, and assess the performance of workload-aware loop scheduling strategies. In this methodology, a genetic algorithm is employed to explore the state space solution of the problem itself and to guide the design of new loop scheduling strategies, and a simulator is used to evaluate their performance. As a proof of concept, we show how the proposed methodology was used to propose and study a new workload-aware loop scheduling strategy named smart round-robin (SRR). We implemented this strategy into GNU Compiler Collection's OpenMP runtime. We carry out several experiments to validate the simulator and to evaluate the performance of SRR. Our experimental results show that SRR may deliver up to 37.89% and 14.10% better performance than OpenMP's dynamic loop scheduling strategy in the simulated environment and in a real-world application kernel, respectively
Description, Implementation and Evaluation of an Affinity Clause for Task Directives
International audienceOpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally , we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability
On the Performance and Isolation of Asymmetric Microkernel Design for Lightweight Manycores
International audienc
Description, Implementation and Evaluation of an Affinity Clause for Task Directives
International audienceOpenMP 4.0 introduced dependent tasks, which give the programmer a way to express fine grain parallelism. Using appropriate OS support (such as NUMA libraries), the runtime can rely on the information in the depend clause to dynamically map the tasks to the architecture topology. Controlling data locality is one of the key factors to reach a high level of performance when targeting NUMA architectures. On this topic, OpenMP does not provide a lot of flexibility to the programmer yet, which lets the runtime decide where a task should be executed. In this paper, we present a class of applications which would benefit from having such a control and flexibility over tasks and data placement. We also propose our own interpretation of the new affinity clause for the task directive, which is being discussed by the OpenMP Architecture Review Board. This clause enables the programmer to give hints to the runtime about tasks placement during the program execution, which can be used to control the data mapping on the architecture. In our proposal, the programmer can express affinity between a task and the following resources: a thread, a NUMA node, and a data. We then present an implementation of this proposal in the Clang-3.8 compiler, and an implementation of the corresponding extensions in our OpenMP runtime LIBKOMP. Finally , we present a preliminary evaluation of this work running two task-based OpenMP kernels on a 192-core NUMA architecture, that shows noticeable improvements both in terms of performance and scalability
Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite
International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches
Phase-TA: Periodicity Detection and Characterization for HPC Applications
International audienceThe world of High-Performance Computing (HPC) currently stands on the edge of the ExaScale. The supercomputers are growing ever more powerful, requiring power-efficient components and ever smarter tool-suites to operate them. One of the key features of those frameworks will be their ability to monitor and predict the behavior of executed applications to optimize resources utilization, and abide by the operating constraints, notably on power consumption. In this context, this article presents Phase-TA, an offline tool which detects and characterizes the inherent periodicities of iterative HPC applications, with no prior knowledge of the latter. To do so, it analyzes the evolution of several performance counters at the scale of the compute node, and infers patterns representing the identified periodicities. As a result, Phase-TA offers a nonintrusive mean to gain insights on the processor use associated with an application, and paves the way to predicting its behavior. Phase-TA was tested on a panel of 3 applications and benchmarks from the supercomputing field: HPCG, NEMO, and OpenFoam. For all of them, periodicities, accountable for on average 78% of their execution time, were detected and represented by accurate patterns. Furthermore, it was demonstrated that there is no need to analyze the whole profile of an application to precisely characterize its periodic behaviors. Indeed, an extract of the aforementioned profile is enough for Phase-TA to infer representative patterns on-the-fly, opening the way to energyefficiency optimization through Dynamic Voltage-Frequency Scaling (DVFS)